- 
                Notifications
    You must be signed in to change notification settings 
- Fork 640
[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467
Conversation
| Thanks for your contribution! | 
2c56b6d    to
    36ad6ed      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds draft_logprobs support for the Speculative Decode MTP mode, enabling developers to capture intermediate prediction probabilities during speculative decoding. The primary goal is to enhance observability and debuggability of the speculative decoding process while maintaining full backward compatibility with existing OpenAI-compatible interfaces.
Key changes:
- Added include_draft_logprobsparameter to/v1/completionsand/v1/chat/completionsrequest APIs
- Introduced draft_logprobsfield in response structures to carry intermediate draft token probabilities
- Extended token processing logic to handle and buffer draft logprobs separately from target logprobs in speculative decoding scenarios
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description | 
|---|---|
| fastdeploy/entrypoints/openai/protocol.py | Added include_draft_logprobsrequest parameter anddraft_logprobsresponse field to protocol definitions | 
| fastdeploy/entrypoints/openai/serving_completion.py | Implemented draft logprobs aggregation and processing logic for completion endpoints | 
| fastdeploy/entrypoints/openai/serving_chat.py | Implemented draft logprobs aggregation and processing logic for chat endpoints | 
| fastdeploy/engine/request.py | Extended CompletionOutputandRequestOutputclasses with draft logprobs support and output type tracking | 
| fastdeploy/output/token_processor.py | Core token processing logic updated to extract, buffer, and merge draft/target logprobs in speculative decoding mode | 
| tests/output/test_process_batch_output.py | Added unit test infrastructure for validating speculative decoding with logprobs | 
| if self.use_logprobs: | ||
| mtype = int(self.output_tokens[1, 0].item()) | ||
| batch = self.output_tokens[2, 0] | ||
| accept_num = [int(num[0]) for num in self.output_tokens[3 : batch + 3]] | ||
| tokens = tokens[3 + MAX_BSZ : 3 + MAX_BSZ + batch * MAX_DRAFT_TOKENS * (K + 1)].reshape( | ||
| [batch, MAX_DRAFT_TOKENS, K + 1] | ||
| ) | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers (3, K+1) are used without explanation. Consider extracting these as named constants like METADATA_HEADER_SIZE = 3 and TOPK_SIZE = K + 1 to improve code clarity and maintainability.
| for target, decode in zip(self._batch_result_buffer, draft_batch_result): | ||
| target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs | ||
| target_batch_result.append(target) | ||
| self._batch_result_buffer = None | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue if self._batch_result_buffer and draft_batch_result have different lengths. The zip operation will silently truncate to the shorter list, which could lead to data loss. Add length validation before the zip operation.
| for target, decode in zip(self._batch_result_buffer, draft_batch_result): | |
| target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs | |
| target_batch_result.append(target) | |
| self._batch_result_buffer = None | |
| if len(self._batch_result_buffer) != len(draft_batch_result): | |
| llm_logger.error( | |
| f"Length mismatch: _batch_result_buffer ({len(self._batch_result_buffer)}) vs draft_batch_result ({len(draft_batch_result)}). Skipping postprocess for this batch." | |
| ) | |
| else: | |
| for target, decode in zip(self._batch_result_buffer, draft_batch_result): | |
| target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs | |
| target_batch_result.append(target) | |
| self._batch_result_buffer = None | 
| if logprobs_res and logprobs_res.content is not None: | ||
| logprob_contents.extend(logprobs_res.content) | ||
|  | ||
| # draf_logprobs | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'draf_logprobs' to 'draft_logprobs'.
| # draf_logprobs | |
| # draft_logprobs | 
| final_res = final_res_batch[idx] | ||
| prompt_token_ids = prompt_batched_token_ids[idx] | ||
| assert prompt_token_ids is not None | ||
| prompt_text = request.prompt | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable prompt_text is assigned here but immediately reassigned at line 568 if request.echo is true. Consider moving this assignment inside the else block of the echo condition to avoid unnecessary assignment.
| """ | ||
| try: | ||
| self.cached_generated_tokens.put_results(batch_result) | ||
| if self.cfg.speculative_config.method and self.use_logprobs: | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The condition self.cfg.speculative_config.method and self.use_logprobs is checked here, but the same logic pattern appears multiple times. Consider extracting this into a property method like is_speculative_with_logprobs for better readability and maintainability.
| llm_logger.info( | ||
| f"Request: {task_id} finished, number of " f"generated tokens: {self.tokens_counter[task_id]}." | ||
| f"Request: {task_id} finished, number of " | ||
| f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}" | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Log message formatting is inconsistent with spacing around colons. Add spaces after colons for better readability: token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}.
| f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}" | |
| f"generated tokens: {self.tokens_counter[task_id]}, token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}" | 
| ] * MAX_DRAFT_TOKENS | ||
| processor.speculative_stats_step = 0 | ||
|  | ||
| # processor._recycle_resources = Mock() | 
    
      
    
      Copilot
AI
    
    
    
      Oct 17, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. If this is needed for future reference, document why it's commented or remove it entirely.
| # processor._recycle_resources = Mock() | 
[Speculative Decoding] Add
draft_logprobsSupport for Speculative Decode MTPMotivation
本 PR 为 MTP 的 Speculative Decode 模式 增加了
draft_logprobs支持,并扩展了 OpenAI 兼容接口,以便开发者能够在推测解码过程中获取中间预测概率信息。主要目的:
Modifications
1. 新增请求参数
include_draft_logprobsbool/v1/completions与/v1/chat/completions请求体中draft_logprobs)2. 新增响应参数
draft_logprobsinclude_draft_logprobs=true时,响应中将包含该字段。3. 兼容性
logprobs字段。Usage or Command
示例 1:
/completions接口curl https://{ip}:{port}/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-model", "prompt": "Hello, world!", "logprobs": 5, "include_draft_logprobs": true }'示例响应:
{ "id": "cmpl-xxx", "object": "text_completion", "choices": [ { "text": "Hello", "logprobs": [ ... ], "draft_logprobs": [ ... ] } ] }示例 2:
/chat/completions接口curl https://{ip}:{port}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-model", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, world!"} ], "logprobs": true, "top_logprobs": 5, "include_draft_logprobs": true }'示例响应:
{ "id": "chatcmpl-xxx", "object": "chat.completion", "choices": [ { "message": { "role": "assistant", "content": "Hello", "logprobs": [ ... ], "draft_logprobs": [ ... ] } } ] }Accuracy Tests
include_draft_logprobs的情况下,新增字段仅用于观测,不影响模型输出结果。Checklist
[Speculative Decoding]pre-commitpassed)include_draft_logprobsreleasebranch if needed